Introduction to the dataset¶

Insurance Dataset Analysis Report

Overview

This report presents an analysis of a synthetic insurance dataset using visualization and statistical tools to uncover patterns and relationships between key variables. The dataset is simulated based on real-world data to ensure privacy while maintaining realistic patterns for analysis.

Objective

The primary objectives of this analysis are:

Hypothesis 1: To assess whether the frequency of claims affects the premium amount.

To determine whether policyholder's credit score has a direct impact on the premium amount.

To provide insights that may enhance risk assessment and inform decision-making for insurance providers.

Dataset Summary

Size: 10,000 rows × 27 columns Missing Values: None Age Range: 18 to 90 years Average Age: Approximately 40 years

Initial Observations

The age distribution is right-skewed, not normal due to high volume of young policyholders. The most frequent age is 18, with 822 policyholders. The majority of policyholders have no claims, and very few have more than three claims.

Claim Severity Distribution

Low Impact Accidents: 70% Medium Impact Accidents: 20.4% High Severity Accidents: 9.6%

Premium Amount Analysis

The premium value distribution is approximately normal, centered between USD 2,100 and USD 2,400.

There is a clear pattern showing that premium values increase with claim frequency, which aligns with typical insurance pricing models.

Conclusion

The findings suggest that:

There is a direct relationship between claim frequency and premium amounts, and inverse relationship between credit score and premium amount, suggesting that the higher credit scores are associated with lower premium amount.

General overview of the dataset¶

In [58]:
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("C:/Users/aov_f/Downloads/Data Science/Diploma in Data Analytics - Fitzwilliam Institute/Final project/synthetic_insurance_data.csv")
print(df.shape)
df.describe()
(10000, 27)
Out[58]:
Age Is_Senior Married_Premium_Discount Prior_Insurance_Premium_Adjustment Claims_Frequency Claims_Adjustment Policy_Adjustment Premium_Amount Safe_Driver_Discount Multi_Policy_Discount ... Total_Discounts Time_Since_First_Contact Conversion_Status Website_Visits Inquiries Quotes_Requested Time_to_Conversion Credit_Score Premium_Adjustment_Credit Premium_Adjustment_Region
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 ... 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.00000 10000.000000 10000.000000 10000.000000
mean 39.991700 0.159300 42.131400 47.625000 0.497200 36.780000 -79.860000 2219.571400 0.199900 0.305100 ... 30.110000 15.478000 0.576700 5.022900 1.996900 1.996900 46.07320 714.253400 -11.320000 64.325000
std 14.050358 0.365974 42.993376 34.354438 0.716131 65.910288 97.955806 148.521132 0.399945 0.460473 ... 33.689782 8.677975 0.494107 2.238231 1.415588 0.817409 45.44845 49.749487 48.704156 39.232618
min 18.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -200.000000 1800.000000 0.000000 0.000000 ... 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.00000 530.000000 -50.000000 0.000000
25% 29.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -200.000000 2100.000000 0.000000 0.000000 ... 0.000000 8.000000 0.000000 3.000000 1.000000 1.000000 6.00000 681.000000 -50.000000 50.000000
50% 39.000000 0.000000 0.000000 50.000000 0.000000 0.000000 0.000000 2236.000000 0.000000 0.000000 ... 50.000000 16.000000 1.000000 5.000000 2.000000 2.000000 12.00000 715.000000 -50.000000 50.000000
75% 50.000000 0.000000 86.000000 50.000000 1.000000 50.000000 0.000000 2336.000000 0.000000 1.000000 ... 50.000000 23.000000 1.000000 6.000000 3.000000 3.000000 99.00000 748.000000 50.000000 100.000000
max 90.000000 1.000000 86.000000 100.000000 5.000000 800.000000 0.000000 2936.000000 1.000000 1.000000 ... 150.000000 30.000000 1.000000 16.000000 9.000000 3.000000 99.00000 850.000000 50.000000 100.000000

8 rows × 21 columns

First five rows¶

In [61]:
df.head(5)
Out[61]:
Age Is_Senior Marital_Status Married_Premium_Discount Prior_Insurance Prior_Insurance_Premium_Adjustment Claims_Frequency Claims_Severity Claims_Adjustment Policy_Type ... Time_Since_First_Contact Conversion_Status Website_Visits Inquiries Quotes_Requested Time_to_Conversion Credit_Score Premium_Adjustment_Credit Region Premium_Adjustment_Region
0 47 0 Married 86 1-5 years 50 0 Low 0 Full Coverage ... 10 0 5 1 2 99 704 -50 Suburban 50
1 37 0 Married 86 1-5 years 50 0 Low 0 Full Coverage ... 22 0 5 1 2 99 726 -50 Urban 100
2 49 0 Married 86 1-5 years 50 1 Low 50 Full Coverage ... 28 0 4 4 1 99 772 -50 Urban 100
3 62 1 Married 86 >5 years 0 1 Low 50 Full Coverage ... 4 1 6 2 2 2 809 -50 Urban 100
4 36 0 Single 0 >5 years 0 2 Low 100 Full Coverage ... 14 1 8 4 2 10 662 50 Suburban 50

5 rows × 27 columns

In [63]:
print(df.columns.tolist())
['Age', 'Is_Senior', 'Marital_Status', 'Married_Premium_Discount', 'Prior_Insurance', 'Prior_Insurance_Premium_Adjustment', 'Claims_Frequency', 'Claims_Severity', 'Claims_Adjustment', 'Policy_Type', 'Policy_Adjustment', 'Premium_Amount', 'Safe_Driver_Discount', 'Multi_Policy_Discount', 'Bundling_Discount', 'Total_Discounts', 'Source_of_Lead', 'Time_Since_First_Contact', 'Conversion_Status', 'Website_Visits', 'Inquiries', 'Quotes_Requested', 'Time_to_Conversion', 'Credit_Score', 'Premium_Adjustment_Credit', 'Region', 'Premium_Adjustment_Region']

Checking for missing values¶

In [66]:
df.isnull().sum()
Out[66]:
Age                                   0
Is_Senior                             0
Marital_Status                        0
Married_Premium_Discount              0
Prior_Insurance                       0
Prior_Insurance_Premium_Adjustment    0
Claims_Frequency                      0
Claims_Severity                       0
Claims_Adjustment                     0
Policy_Type                           0
Policy_Adjustment                     0
Premium_Amount                        0
Safe_Driver_Discount                  0
Multi_Policy_Discount                 0
Bundling_Discount                     0
Total_Discounts                       0
Source_of_Lead                        0
Time_Since_First_Contact              0
Conversion_Status                     0
Website_Visits                        0
Inquiries                             0
Quotes_Requested                      0
Time_to_Conversion                    0
Credit_Score                          0
Premium_Adjustment_Credit             0
Region                                0
Premium_Adjustment_Region             0
dtype: int64
In [68]:
df = df.dropna()

Checking data type¶

In [71]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 27 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   Age                                 10000 non-null  int64 
 1   Is_Senior                           10000 non-null  int64 
 2   Marital_Status                      10000 non-null  object
 3   Married_Premium_Discount            10000 non-null  int64 
 4   Prior_Insurance                     10000 non-null  object
 5   Prior_Insurance_Premium_Adjustment  10000 non-null  int64 
 6   Claims_Frequency                    10000 non-null  int64 
 7   Claims_Severity                     10000 non-null  object
 8   Claims_Adjustment                   10000 non-null  int64 
 9   Policy_Type                         10000 non-null  object
 10  Policy_Adjustment                   10000 non-null  int64 
 11  Premium_Amount                      10000 non-null  int64 
 12  Safe_Driver_Discount                10000 non-null  int64 
 13  Multi_Policy_Discount               10000 non-null  int64 
 14  Bundling_Discount                   10000 non-null  int64 
 15  Total_Discounts                     10000 non-null  int64 
 16  Source_of_Lead                      10000 non-null  object
 17  Time_Since_First_Contact            10000 non-null  int64 
 18  Conversion_Status                   10000 non-null  int64 
 19  Website_Visits                      10000 non-null  int64 
 20  Inquiries                           10000 non-null  int64 
 21  Quotes_Requested                    10000 non-null  int64 
 22  Time_to_Conversion                  10000 non-null  int64 
 23  Credit_Score                        10000 non-null  int64 
 24  Premium_Adjustment_Credit           10000 non-null  int64 
 25  Region                              10000 non-null  object
 26  Premium_Adjustment_Region           10000 non-null  int64 
dtypes: int64(21), object(6)
memory usage: 2.1+ MB

Table showing Regional split¶

In [74]:
table = pd.DataFrame(df.Region.value_counts()).rename(columns = {'':'count'})
print(table)
          count
Region         
Urban      4921
Suburban   3023
Rural      2056

Regional distribution¶

In [77]:
df['Region'] = df['Region'].astype('category')
fig, ax = plt.subplots(figsize=(10, 6))  
df['Region'].value_counts().plot(kind='bar', ax=ax, color='skyblue')
ax.set_xlabel('Region')
ax.set_ylabel('Count')
ax.set_title('Regional distribution')
plt.grid(True)
plt.show()
No description has been provided for this image
In [159]:
table = pd.DataFrame(df.Region.value_counts()).rename(columns = {'':'count'})

ax = table.plot.pie(
    autopct='%1.1f%%',
    figsize=(8, 8),
    title='Regional distribution - Pie Chart', 
    subplots = True
)
plt.show()
No description has been provided for this image

Histogram of Age distribution¶

In [80]:
sns.histplot(df.Age, kde=True,
             bins=int(180/5),
             color='darkblue',
             edgecolor='black',
             linewidth=1)
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
No description has been provided for this image
In [82]:
mode = df['Age'].mode()
print(mode)
0    18
Name: Age, dtype: int64

Correlation of Variables - Pair Plots (EDA Analysis)¶

In [85]:
# Select only numeric columns
numeric_df = df.select_dtypes(include='number')

# Plot scatter matrix
scatter_matrix(numeric_df, figsize=(12, 12), diagonal='kde', alpha=0.5)
plt.suptitle("Pairwise Scatterplots (Pandas)", y=1.02)
plt.show()
No description has been provided for this image
In [87]:
numeric_df = df.select_dtypes(include='number')
sns.pairplot(numeric_df, diag_kind='kde', plot_kws={'alpha': 0.3, 's': 10})
plt.suptitle("Pairwise Scatterplots (Seaborn)", y=1.02)
plt.show()
No description has been provided for this image
In [89]:
correlation_matrix = df.corr(numeric_only=True)

# Display it
print(correlation_matrix)
                                         Age  Is_Senior  \
Age                                 1.000000   0.694873   
Is_Senior                           0.694873   1.000000   
Married_Premium_Discount           -0.010954   0.003055   
Prior_Insurance_Premium_Adjustment -0.112349  -0.050245   
Claims_Frequency                   -0.005683  -0.010700   
Claims_Adjustment                  -0.007991  -0.012606   
Policy_Adjustment                   0.006552   0.007301   
Premium_Amount                     -0.029541  -0.016341   
Safe_Driver_Discount               -0.001894   0.001749   
Multi_Policy_Discount               0.005307   0.003546   
Bundling_Discount                   0.019296   0.019519   
Total_Discounts                     0.010986   0.012044   
Time_Since_First_Contact            0.012382   0.008079   
Conversion_Status                   0.010889   0.014002   
Website_Visits                     -0.026851  -0.013611   
Inquiries                           0.004740   0.002691   
Quotes_Requested                    0.013878   0.010677   
Time_to_Conversion                 -0.011568  -0.013851   
Credit_Score                        0.002005  -0.009078   
Premium_Adjustment_Credit          -0.000525   0.003270   
Premium_Adjustment_Region           0.005704  -0.011979   

                                    Married_Premium_Discount  \
Age                                                -0.010954   
Is_Senior                                           0.003055   
Married_Premium_Discount                            1.000000   
Prior_Insurance_Premium_Adjustment                  0.002825   
Claims_Frequency                                    0.014028   
Claims_Adjustment                                   0.015524   
Policy_Adjustment                                  -0.006465   
Premium_Amount                                      0.291593   
Safe_Driver_Discount                                0.008348   
Multi_Policy_Discount                              -0.011159   
Bundling_Discount                                   0.002578   
Total_Discounts                                    -0.001537   
Time_Since_First_Contact                            0.010795   
Conversion_Status                                  -0.019536   
Website_Visits                                      0.022060   
Inquiries                                           0.005821   
Quotes_Requested                                    0.005185   
Time_to_Conversion                                  0.018586   
Credit_Score                                       -0.013903   
Premium_Adjustment_Credit                           0.007010   
Premium_Adjustment_Region                          -0.014421   

                                    Prior_Insurance_Premium_Adjustment  \
Age                                                          -0.112349   
Is_Senior                                                    -0.050245   
Married_Premium_Discount                                      0.002825   
Prior_Insurance_Premium_Adjustment                            1.000000   
Claims_Frequency                                             -0.009315   
Claims_Adjustment                                            -0.002384   
Policy_Adjustment                                             0.008717   
Premium_Amount                                                0.234541   
Safe_Driver_Discount                                         -0.009844   
Multi_Policy_Discount                                         0.000924   
Bundling_Discount                                            -0.005321   
Total_Discounts                                              -0.007551   
Time_Since_First_Contact                                      0.016891   
Conversion_Status                                            -0.005617   
Website_Visits                                                0.004869   
Inquiries                                                     0.004887   
Quotes_Requested                                             -0.004358   
Time_to_Conversion                                            0.003907   
Credit_Score                                                  0.005089   
Premium_Adjustment_Credit                                    -0.013828   
Premium_Adjustment_Region                                     0.002057   

                                    Claims_Frequency  Claims_Adjustment  \
Age                                        -0.005683          -0.007991   
Is_Senior                                  -0.010700          -0.012606   
Married_Premium_Discount                    0.014028           0.015524   
Prior_Insurance_Premium_Adjustment         -0.009315          -0.002384   
Claims_Frequency                            1.000000           0.803950   
Claims_Adjustment                           0.803950           1.000000   
Policy_Adjustment                           0.000091          -0.006312   
Premium_Amount                              0.355371           0.439130   
Safe_Driver_Discount                        0.012604           0.012242   
Multi_Policy_Discount                       0.002136          -0.003842   
Bundling_Discount                          -0.013802          -0.018184   
Total_Discounts                             0.002873          -0.003354   
Time_Since_First_Contact                    0.001809          -0.007774   
Conversion_Status                          -0.025537          -0.022603   
Website_Visits                              0.005437           0.005137   
Inquiries                                   0.004283           0.005831   
Quotes_Requested                           -0.010351          -0.009485   
Time_to_Conversion                          0.024032           0.021283   
Credit_Score                                0.002092           0.005915   
Premium_Adjustment_Credit                  -0.008364          -0.014066   
Premium_Adjustment_Region                   0.000093          -0.002175   

                                    Policy_Adjustment  Premium_Amount  \
Age                                          0.006552       -0.029541   
Is_Senior                                    0.007301       -0.016341   
Married_Premium_Discount                    -0.006465        0.291593   
Prior_Insurance_Premium_Adjustment           0.008717        0.234541   
Claims_Frequency                             0.000091        0.355371   
Claims_Adjustment                           -0.006312        0.439130   
Policy_Adjustment                            1.000000        0.663374   
Premium_Amount                               0.663374        1.000000   
Safe_Driver_Discount                        -0.007045       -0.131944   
Multi_Policy_Discount                        0.013864       -0.144375   
Bundling_Discount                           -0.017149       -0.117614   
Total_Discounts                             -0.002247       -0.228695   
Time_Since_First_Contact                     0.007639       -0.001183   
Conversion_Status                           -0.065393       -0.078765   
Website_Visits                               0.016918        0.024758   
Inquiries                                   -0.008277        0.002993   
Quotes_Requested                             0.024137        0.007680   
Time_to_Conversion                           0.062277        0.074710   
Credit_Score                                 0.001990       -0.251238   
Premium_Adjustment_Credit                    0.013203        0.325845   
Premium_Adjustment_Region                    0.006244        0.265795   

                                    Safe_Driver_Discount  \
Age                                            -0.001894   
Is_Senior                                       0.001749   
Married_Premium_Discount                        0.008348   
Prior_Insurance_Premium_Adjustment             -0.009844   
Claims_Frequency                                0.012604   
Claims_Adjustment                               0.012242   
Policy_Adjustment                              -0.007045   
Premium_Amount                                 -0.131944   
Safe_Driver_Discount                            1.000000   
Multi_Policy_Discount                          -0.005373   
Bundling_Discount                              -0.007008   
Total_Discounts                                 0.586817   
Time_Since_First_Contact                        0.004768   
Conversion_Status                              -0.001935   
Website_Visits                                 -0.010924   
Inquiries                                      -0.009681   
Quotes_Requested                                0.004649   
Time_to_Conversion                              0.002936   
Credit_Score                                   -0.017273   
Premium_Adjustment_Credit                       0.009132   
Premium_Adjustment_Region                      -0.010425   

                                    Multi_Policy_Discount  ...  \
Age                                              0.005307  ...   
Is_Senior                                        0.003546  ...   
Married_Premium_Discount                        -0.011159  ...   
Prior_Insurance_Premium_Adjustment               0.000924  ...   
Claims_Frequency                                 0.002136  ...   
Claims_Adjustment                               -0.003842  ...   
Policy_Adjustment                                0.013864  ...   
Premium_Amount                                  -0.144375  ...   
Safe_Driver_Discount                            -0.005373  ...   
Multi_Policy_Discount                            1.000000  ...   
Bundling_Discount                               -0.007740  ...   
Total_Discounts                                  0.676809  ...   
Time_Since_First_Contact                         0.003444  ...   
Conversion_Status                                0.024390  ...   
Website_Visits                                  -0.017842  ...   
Inquiries                                        0.000070  ...   
Quotes_Requested                                 0.002779  ...   
Time_to_Conversion                              -0.023948  ...   
Credit_Score                                    -0.016293  ...   
Premium_Adjustment_Credit                        0.014213  ...   
Premium_Adjustment_Region                        0.000246  ...   

                                    Total_Discounts  Time_Since_First_Contact  \
Age                                        0.010986                  0.012382   
Is_Senior                                  0.012044                  0.008079   
Married_Premium_Discount                  -0.001537                  0.010795   
Prior_Insurance_Premium_Adjustment        -0.007551                  0.016891   
Claims_Frequency                           0.002873                  0.001809   
Claims_Adjustment                         -0.003354                 -0.007774   
Policy_Adjustment                         -0.002247                  0.007639   
Premium_Amount                            -0.228695                 -0.001183   
Safe_Driver_Discount                       0.586817                  0.004768   
Multi_Policy_Discount                      0.676809                  0.003444   
Bundling_Discount                          0.430216                 -0.004614   
Total_Discounts                            1.000000                  0.003155   
Time_Since_First_Contact                   0.003155                  1.000000   
Conversion_Status                          0.020160                 -0.010277   
Website_Visits                            -0.018960                 -0.002829   
Inquiries                                 -0.002342                  0.004623   
Quotes_Requested                           0.002119                  0.007865   
Time_to_Conversion                        -0.018834                  0.010437   
Credit_Score                              -0.025476                  0.013878   
Premium_Adjustment_Credit                  0.016362                 -0.020797   
Premium_Adjustment_Region                 -0.007813                 -0.008584   

                                    Conversion_Status  Website_Visits  \
Age                                          0.010889       -0.026851   
Is_Senior                                    0.014002       -0.013611   
Married_Premium_Discount                    -0.019536        0.022060   
Prior_Insurance_Premium_Adjustment          -0.005617        0.004869   
Claims_Frequency                            -0.025537        0.005437   
Claims_Adjustment                           -0.022603        0.005137   
Policy_Adjustment                           -0.065393        0.016918   
Premium_Amount                              -0.078765        0.024758   
Safe_Driver_Discount                        -0.001935       -0.010924   
Multi_Policy_Discount                        0.024390       -0.017842   
Bundling_Discount                            0.010554       -0.000642   
Total_Discounts                              0.020160       -0.018960   
Time_Since_First_Contact                    -0.010277       -0.002829   
Conversion_Status                            1.000000        0.025315   
Website_Visits                               0.025315        1.000000   
Inquiries                                   -0.007024       -0.002313   
Quotes_Requested                            -0.004983        0.001241   
Time_to_Conversion                          -0.997763       -0.024612   
Credit_Score                                 0.011398       -0.006573   
Premium_Adjustment_Credit                   -0.023969        0.001874   
Premium_Adjustment_Region                   -0.023537       -0.004192   

                                    Inquiries  Quotes_Requested  \
Age                                  0.004740          0.013878   
Is_Senior                            0.002691          0.010677   
Married_Premium_Discount             0.005821          0.005185   
Prior_Insurance_Premium_Adjustment   0.004887         -0.004358   
Claims_Frequency                     0.004283         -0.010351   
Claims_Adjustment                    0.005831         -0.009485   
Policy_Adjustment                   -0.008277          0.024137   
Premium_Amount                       0.002993          0.007680   
Safe_Driver_Discount                -0.009681          0.004649   
Multi_Policy_Discount                0.000070          0.002779   
Bundling_Discount                    0.007635         -0.005777   
Total_Discounts                     -0.002342          0.002119   
Time_Since_First_Contact             0.004623          0.007865   
Conversion_Status                   -0.007024         -0.004983   
Website_Visits                      -0.002313          0.001241   
Inquiries                            1.000000          0.003449   
Quotes_Requested                     0.003449          1.000000   
Time_to_Conversion                   0.007048          0.003608   
Credit_Score                        -0.025562          0.014190   
Premium_Adjustment_Credit            0.006526         -0.018341   
Premium_Adjustment_Region            0.001430          0.007466   

                                    Time_to_Conversion  Credit_Score  \
Age                                          -0.011568      0.002005   
Is_Senior                                    -0.013851     -0.009078   
Married_Premium_Discount                      0.018586     -0.013903   
Prior_Insurance_Premium_Adjustment            0.003907      0.005089   
Claims_Frequency                              0.024032      0.002092   
Claims_Adjustment                             0.021283      0.005915   
Policy_Adjustment                             0.062277      0.001990   
Premium_Amount                                0.074710     -0.251238   
Safe_Driver_Discount                          0.002936     -0.017273   
Multi_Policy_Discount                        -0.023948     -0.016293   
Bundling_Discount                            -0.009576     -0.009299   
Total_Discounts                              -0.018834     -0.025476   
Time_Since_First_Contact                      0.010437      0.013878   
Conversion_Status                            -0.997763      0.011398   
Website_Visits                               -0.024612     -0.006573   
Inquiries                                     0.007048     -0.025562   
Quotes_Requested                              0.003608      0.014190   
Time_to_Conversion                            1.000000     -0.010306   
Credit_Score                                 -0.010306      1.000000   
Premium_Adjustment_Credit                     0.022576     -0.787820   
Premium_Adjustment_Region                     0.023592      0.000910   

                                    Premium_Adjustment_Credit  \
Age                                                 -0.000525   
Is_Senior                                            0.003270   
Married_Premium_Discount                             0.007010   
Prior_Insurance_Premium_Adjustment                  -0.013828   
Claims_Frequency                                    -0.008364   
Claims_Adjustment                                   -0.014066   
Policy_Adjustment                                    0.013203   
Premium_Amount                                       0.325845   
Safe_Driver_Discount                                 0.009132   
Multi_Policy_Discount                                0.014213   
Bundling_Discount                                    0.002794   
Total_Discounts                                      0.016362   
Time_Since_First_Contact                            -0.020797   
Conversion_Status                                   -0.023969   
Website_Visits                                       0.001874   
Inquiries                                            0.006526   
Quotes_Requested                                    -0.018341   
Time_to_Conversion                                   0.022576   
Credit_Score                                        -0.787820   
Premium_Adjustment_Credit                            1.000000   
Premium_Adjustment_Region                            0.001261   

                                    Premium_Adjustment_Region  
Age                                                  0.005704  
Is_Senior                                           -0.011979  
Married_Premium_Discount                            -0.014421  
Prior_Insurance_Premium_Adjustment                   0.002057  
Claims_Frequency                                     0.000093  
Claims_Adjustment                                   -0.002175  
Policy_Adjustment                                    0.006244  
Premium_Amount                                       0.265795  
Safe_Driver_Discount                                -0.010425  
Multi_Policy_Discount                                0.000246  
Bundling_Discount                                   -0.004078  
Total_Discounts                                     -0.007813  
Time_Since_First_Contact                            -0.008584  
Conversion_Status                                   -0.023537  
Website_Visits                                      -0.004192  
Inquiries                                            0.001430  
Quotes_Requested                                     0.007466  
Time_to_Conversion                                   0.023592  
Credit_Score                                         0.000910  
Premium_Adjustment_Credit                            0.001261  
Premium_Adjustment_Region                            1.000000  

[21 rows x 21 columns]
In [91]:
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm', center=0)
plt.title("Correlation Matrix")
plt.show()
No description has been provided for this image

Histogram of Claims Frequency Distribution¶

In [94]:
sns.histplot(df.Claims_Frequency, kde=True,
             bins=int(180/5),
             color='darkblue',
             edgecolor='black',
             linewidth=1)
plt.title('Claims distribution')
plt.xlabel('Claims Frequency')
plt.ylabel('Count')
plt.grid(True)
plt.show()
No description has been provided for this image
In [96]:
df['Claims_Frequency'] = df['Claims_Frequency'].astype('category')
fig, ax = plt.subplots(figsize=(10, 6))  
df['Claims_Frequency'].value_counts().plot(kind='bar', ax=ax, color='skyblue')
ax.set_xlabel('Claims Frequency')
ax.set_ylabel('Count')
ax.set_title('Claims distribution')
plt.grid(True)
plt.show()
No description has been provided for this image

Table showing the count of accidents categorized by severity¶

In [149]:
table = pd.DataFrame(df.Claims_Severity.value_counts()).rename(columns = {'':'count'})
print(table)
                 count
Claims_Severity       
Low               7003
Medium            2038
High               959

Claims categorized by severity - Pie Chart¶

In [155]:
ax = table.plot.pie(
    autopct='%1.1f%%',
    figsize=(8, 8),
    title='Claims categorized by severity - Pie Chart', 
    subplots = True
)
plt.show()
No description has been provided for this image

Histogram showing Premium Amount distribution¶

In [99]:
plt.figure(figsize=(10, 6))
plt.hist(df['Premium_Amount'], bins=30, alpha=0.7)
plt.xlabel('Premium Amount')
plt.ylabel('Count')
plt.title('Histogram - Premium Amount')
plt.grid(True)
plt.show()
No description has been provided for this image
In [101]:
sns.histplot(df.Premium_Amount, kde=True,
             bins=int(180/5),
             color='darkblue',
             edgecolor='black',
             linewidth=1)
plt.title('Premium Amount distribution')
plt.xlabel('Premium Amount')
plt.ylabel('Count')
plt.grid(True)
plt.show()
No description has been provided for this image

Box plot of Premium Amount Distribution - Checking for outliers¶

In [104]:
plt.boxplot(df.Premium_Amount)
plt.title("Boxplot of Premium Amount distribution")
plt.show()
No description has been provided for this image

Hypothesis 1: Ho (null hypothesis): Claims Frequency affects the Premium Amount¶

Split dataset into train (75%) and test (25%)¶

In [108]:
x = df[['Claims_Frequency']]
y = df['Premium_Amount']  

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

print(f"Training data size: {x_train.shape[0]}")
print(f"Testing data size: {x_test.shape[0]}")
Training data size: 7500
Testing data size: 2500

Training the Regression model¶

In [115]:
model = LinearRegression()

# Train the model using the training set
model.fit(x_train, y_train)
Out[115]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [117]:
# Make predictions on the test set
y_pred = model.predict(x_test)

Checking correlation between Claims Frequency and Premium Amount¶

In [389]:
x = df['Claims_Frequency']
y = df['Premium_Amount']
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         Premium_Amount   R-squared:                       0.126
Model:                            OLS   Adj. R-squared:                  0.126
Method:                 Least Squares   F-statistic:                     1445.
Date:                Wed, 07 May 2025   Prob (F-statistic):          1.78e-295
Time:                        16:12:38   Log-Likelihood:                -63521.
No. Observations:               10000   AIC:                         1.270e+05
Df Residuals:                    9998   BIC:                         1.271e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const             2182.9269      1.690   1291.544      0.000    2179.614    2186.240
Claims_Frequency    73.7018      1.939     38.015      0.000      69.901      77.502
==============================================================================
Omnibus:                      123.162   Durbin-Watson:                   1.974
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               81.995
Skew:                          -0.089   Prob(JB):                     1.57e-18
Kurtosis:                       2.593   Cond. No.                         1.94
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Testing model accuracy¶

In [119]:
# Prepare data
x = df[['Claims_Frequency']].values
y = df['Premium_Amount'].values

# Fit model
reg = LinearRegression()
reg.fit(x, y)

# Predictions
y_pred = reg.predict(x)

# Metrics
print(f"R² score: {reg.score(x, y)}")
print(f"Slope: {reg.coef_[0]}")
print(f"Intercept: {reg.intercept_}")
print(f"RMSE: {np.sqrt(mean_squared_error(y, y_pred))}")
R² score: 0.12628866618004875
Slope: 73.70178966854829
Intercept: 2182.926870176798
RMSE: 138.81951368500023
In [122]:
# Predictions
y_pred = reg.predict(x)

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(x, y, alpha=0.5, label='Actual Data')
plt.plot(x, y_pred, color='red', label='Regression Line')
plt.xlabel('Claims Frequency')
plt.ylabel('Premium Amount')
plt.title('Linear Regression: Claims Frequency vs Premium Amount')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

Hypothesis 2: Ho(null hypothesis): Credit Score affects the Premium Amount¶

Credit Score distribution¶

In [126]:
sns.histplot(df.Credit_Score, kde=True,
             bins=int(180/5),
             color='darkblue',
             edgecolor='black',
             linewidth=1)
plt.title('Credit Score distribution')
plt.xlabel('Credit Score')
plt.ylabel('Count')
plt.grid(True)
plt.show()
No description has been provided for this image

Split dataset into train(75%) and test(25%)¶

In [129]:
x = df[['Credit_Score']]
y = df['Premium_Amount']  

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

print(f"Training data size: {x_train.shape[0]}")
print(f"Testing data size: {x_test.shape[0]}")
Training data size: 7500
Testing data size: 2500

Training the regression model¶

In [132]:
model = sm.OLS(y_train, sm.add_constant(x_train))  # model creation
results = model.fit()  # fitting the model

# Predict:
predictions = results.predict(sm.add_constant(x_test))
In [134]:
plt.scatter(df['Credit_Score'], df['Premium_Amount'], alpha=0.5, label='Data')
plt.plot(df['Credit_Score'], y_pred, color='red', label='Regression Line')
plt.xlabel('Credit_Score')
plt.ylabel('Premium Amount')
plt.title('Credit Score vs Premium Amount')
plt.legend()
plt.show()
No description has been provided for this image
In [136]:
x = df['Credit_Score']
y = df['Premium_Amount']
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         Premium_Amount   R-squared:                       0.063
Model:                            OLS   Adj. R-squared:                  0.063
Method:                 Least Squares   F-statistic:                     673.6
Date:                Fri, 09 May 2025   Prob (F-statistic):          8.87e-144
Time:                        13:38:50   Log-Likelihood:                -63870.
No. Observations:               10000   AIC:                         1.277e+05
Df Residuals:                    9998   BIC:                         1.278e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const         2755.2915     20.691    133.162      0.000    2714.732    2795.851
Credit_Score    -0.7500      0.029    -25.954      0.000      -0.807      -0.693
==============================================================================
Omnibus:                       40.404   Durbin-Watson:                   1.982
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               43.592
Skew:                           0.122   Prob(JB):                     3.42e-10
Kurtosis:                       3.212   Cond. No.                     1.03e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.03e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [139]:
# Prepare data
x = df[['Credit_Score']].values
y = df['Premium_Amount'].values

# Fit model
reg = LinearRegression()
reg.fit(x, y)

# Predictions
y_pred = reg.predict(x)

# Metrics
print(f"R² score: {reg.score(x, y)}")
print(f"Slope: {reg.coef_[0]}")
print(f"Intercept: {reg.intercept_}")
print(f"RMSE: {np.sqrt(mean_squared_error(y, y_pred))}")
R² score: 0.06312071577216283
Slope: -0.7500420247872055
Intercept: 2755.2914663471456
RMSE: 143.75016505043345
In [144]:
# Predictions
y_pred = reg.predict(x)

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(x, y, alpha=0.5, label='Actual Data')
plt.plot(x, y_pred, color='red', label='Regression Line')
plt.xlabel('Credit Score')
plt.ylabel('Premium Amount')
plt.title('Linear Regression: Credit Score vs Premium Amount')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
In [ ]: